Statistical significance and extremal ensemble of gapped local hybrid alignment
نویسندگان
چکیده
A “semi-probabilistic” alignment algorithm which combines ideas from Smith-Waterman and probabilistic alignment is proposed and studied in detail. It is predicted that the score statistics of this “hybrid” algorithm is of the universal Gumbel form, with the key Gumbel parameter λ taking on a fixed asymptotic value for a wide variety of scoring parameters. We have also characterized the “extremal ensemble”, i.e., the collection of sequence pairs exhibiting similarities that a given scoring system is most sensitive to. Based on this extremal ensemble, a simple recipe for the computation of the “relative entropy”, and from it the correction to λ due to finite sequence length is also given. This allows us to assign p-values to the alignment results for arbitrary scoring parameters and gap costs. The predictions compare well with direct numerical simulations for a broad range of sequence lengths with various choices of the substitution scores and affine gap parameters.
منابع مشابه
Rapid Assessment of Extremal Statistics for Gapped Local Alignment
The statistical significance of gapped local alignments is characterized by analyzing the extremal statistics of the scores obtained from the alignment of random amino acid sequences. By identifying a complete set of linked clusters, "islands," we devise a method which accurately predicts the extremal score statistics by using only one to a few pairwise alignments. The success of our method rel...
متن کاملStatistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the "local" version of maximum-likelihood or hidden Markov model method) is found to have anomalous statistics. A modified "semi-probabilistic" alignment consisting of a hybrid of Smith-Waterman and probabilistic alignment is...
متن کاملScore distributions of gapped multiple sequence alignments down to the low-probability tail.
Assessing the significance of alignment scores of optimally aligned DNA or amino acid sequences can be achieved via the knowledge of the score distribution of random sequences. But this requires obtaining the distribution in the biologically relevant high-scoring region, where the probabilities are exponentially small. For gapless local alignments of infinitely long sequences this distribution ...
متن کاملEnhancing Parallelism of Pairwise Statistical Significance Estimation for Local Sequence Alignment
Pairwise statistical significance (PSS) has been found to be able to accurately identify related sequences (homology detection), which is a fundamental step in numerous applications relating to sequence analysis. Although more accurate than database statistical significance, it is both computationally intensive and data intensive to construct the empirical score distribution during the estimati...
متن کاملRandom differential inequalities and comparison principles for nonlinear hybrid random differential equations
In this paper, some basic results concerning strict, nonstrict inequalities, local existence theorem and differential inequalities have been proved for an IVP of first order hybrid random differential equations with the linear perturbation of second type. A comparison theorem is proved and applied to prove the uniqueness of random solution for the considered perturbed random differential eq...
متن کامل